[1] 0.0002187 0.0035721 0.0250047 0.0972405 0.2268945 0.3176523 0.2470629
[8] 0.0823543
…and Likelihoods!
February 7, 2024
Suppose I have a population of size \(N\) and a categorical variable \(Y\) (e.g. \(Y\) could be genotype with levels {aa, Aa, AA}).
For proportions, we really need a categorical variable with only two levels, which would be
Suppose I have a population of size \(N\) and a categorical variable \(Y\) (e.g. \(Y\) could be genotype with levels {aa, Aa, AA}).
For example, with the genotype case, we could be interested in the proportion of heterozygotes in a population, so we would have
Given the categorical variable with levels “Success” and “Failure”, the proportion of successes in the population would be denoted by
\[p = \frac{\mathrm{Number \ of \ successes \ in \ population}}{\mathrm{Total \ population \ size}} = \frac{X}{N}\]
If we have a sample of size \(n\), then we would have a sample estimate of this proportion \(p\) given by
\[\hat{p} = \frac{\mathrm{Number \ of \ successes \ in \ sample}}{\mathrm{Total \ sample \ size}} = \frac{\hat{X}}{n}\]
Proportion of land area that is water on Earth
Proportion of land area that is water on Earth
Example data with 7 tosses: W W L L W L W
This process is known as a binomial process, and the distribution of the number of waters expected in \(n\) tosses is given by the binomial distribution.
Definition: The
binomial distribution provides the probability distribution for the number of “successes” in a fixed number of independent trials, when the probability of success is the same in each trial.
Properties:
If we have \(n\) trials, and the probability of success in each trial is \(p\), we have \[\mathrm{Pr[}x \mathrm{ \ successes]} = \left(\begin{array}{c}{n \\ x}\end{array}\right)p^{x}(1-p)^{n-x},\] where \[\left(\begin{array}{c}{n \\ x}\end{array}\right) = \frac{n!}{x!(n-x)!},\]
and \(n! = n\times(n-1)\times(n-2)\cdots 2\times 1.\)
To figure out Pr[\(x\) successes], first ask
Question: “What are all different outcomes of \(x\) successes in \(n\) trials?”
Example: Suppose \(n=3\) and \(x=2\).
\[2 \ \mathrm{successes} = \{SSF, SFS, FSS\}\]
\[\mathrm{Pr}[SSF] = \mathrm{Pr}[S]\times \mathrm{Pr}[S]\times \mathrm{Pr}[F] = p^2(1-p)\]
\[\mathrm{Pr}[SFS] = \mathrm{Pr}[S]\times \mathrm{Pr}[F]\times \mathrm{Pr}[S] = p^2(1-p)\]
\[\mathrm{Pr}[FSS] = \mathrm{Pr}[F]\times \mathrm{Pr}[S]\times \mathrm{Pr}[S] = p^2(1-p)\]
Example: Suppose \(n=3\) and \(x=2\).
\[2 \ \mathrm{successes} = \{SSF, SFS, FSS\}\]
\[\mathrm{Pr}[SSF] = \mathrm{Pr}[SFS] = \mathrm{Pr}[FSS] = p^2(1-p) = p^x(1-p)^{n-x}\]
How many ways are there to have 2 successes in 3 trials? 3 choose 2!!
\[\mathrm{Pr[2 \ successes]} = \left(\begin{array}{c}{3 \\ 2}\end{array}\right)p^2(1-p)\]
To get values of probability distribution, use the dbinom function. Supposing \(n=7\) and \(p=0.7\), we have:
[1] 0.0002187 0.0035721 0.0250047 0.0972405 0.2268945 0.3176523 0.2470629
[8] 0.0823543
The d in dbinom stands for distribution.
Question: Given \(p=0.3\) and \(n=20\), what is Pr[6 successes]? (Write out using notation)
Answer: \[\mathrm{Pr[}6 \mathrm{ \ successes]} = \left(\begin{array}{c}{20 \\ 6}\end{array}\right)0.3^{6}\times 0.7^{14}.\]
Let’s plot the distribution:
Let’s plot the distribution:
Question: How do I generate data from a probability distribution?
Random draw from a distribution: Use rbinom.
Convert to “raw” data.
import {aq, op} from "@uwdata/arquero" // JavaScript dplyr
import {vl} from "@vega/vega-lite-api-v5" // JavaScript ggplot2
ss = require('simple-statistics')viewof N_par = Inputs.range(
[1, 100],
{value: 7, step: 1, label: "n"}
)
viewof X_par = Inputs.range(
[0, N_par],
{value: 4, step: 1, label: "x"}
)\[ \mathrm{L}(\color{blue}{p}) = \left(\begin{array}{c}\color{red}{n} \\ \color{red}{x}\end{array}\right)\color{blue}{p}^{\color{red}{x}}(1-\color{blue}{p})^{\color{red}{n}-\color{red}{x}} \]
\[ \mathrm{L}(\color{blue}{p}) = \left(\begin{array}{c}\color{red}{n} \\ \color{red}{x}\end{array}\right)\color{blue}{p}^{\color{red}{x}}(1-\color{blue}{p})^{\color{red}{n}-\color{red}{x}} \]
Question: What is the most likely proportion given the data?
Answer: Maximize the likelihood!
Definition: The maximum likelihood estimate is the paramater \(\hat{p}\) that maximizes the likelihood function and is given by \[ \frac{dL}{dp}\bigl(\hat{p}\bigr) = 0 \]
\[ \hat{p} = \frac{x}{n} \]
We’ll work through slides 8-22 of McElreath’s slides.
McElreath Slides 23-26
We could interpret likelihood as probability distribution over parameter (as long as we scale it)!
Same as garden of forking paths!
What do we do when new data arrives?
Bayesian Updating!!
McElreath Slides 27-31
McElreath Slides 32-43
\[ P(p\,|\,\mathrm{data}) = \frac{P(\mathrm{data}\,|\,p)P(p)}{P(\mathrm{data})} \]
Data: Suppose we tossed 7 times and got 4 water.
p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1Data: Suppose we tossed 7 times and got 4 water.
p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1Data: Suppose we tossed 7 times and got 4 water.
p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1Data: Suppose we tossed 7 times and got 4 water.
p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1Data: Suppose we tossed 7 times and got 4 water.
p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1Data: Suppose we tossed 7 times and got 4 water.
p_grid <- seq(from=0, to=1, length.out=20) # define grid
prior <- rep(1, 20) # define prior
likelihood <- dbinom(4, size=7, prob=p_grid) # compute likelihood at each value in grid
unstd.posterior <- likelihood * prior # compute product of likelihood and prior
posterior <- unstd.posterior / sum(unstd.posterior) # standardize the posterior, so it sums to 1Data: Suppose we tossed 7 times and got 4 water.
\[ P[p|\mathrm{data}] = \frac{L(\mathrm{data}|p)P[p]}{P[\mathrm{data}]} \]
Intro to Quantitative Biology, Spring 2024